<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="stylesheet" href="../../aosa.css" type="text/css">
    <title>The Performance of Open Source Software: Zotonic</title>
  </head>
  <body>

    <div class="titlebox">
      <h1>The Performance of Open Source Applications<br>Zotonic</h1>
      <p class="author">Arjan Scherpenisse and Marc Worrell</p>
    </div>

    <h2 id="introduction-to-zotonic">Introduction to Zotonic</h2>

<p>Zotonic is an open source framework for doing full-stack web development, all the way from frontend to backend. Consisting of a small set of core functionalities, it implements a lightweight but extensible Content Management System on top. Zotonic’s main goal is to make it easy to create well-performing websites “out of the box”, so that a website scales well from the start.</p>

<p>While it shares many features and functionalities with web development frameworks like Django, Drupal, Ruby on Rails and Wordpress, its main competitive advantage is the language that Zotonic is powered by: Erlang. This language, originally developed for building phone switches, allows Zotonic to be fault tolerant and have great performance characteristics.</p>

<p>Like the title says, this chapter focusses on the performance of Zotonic. We’ll look at the reasons why Erlang was chosen as the programming platform, then inspect the <span class="caps">HTTP</span> request stack, then dive in to the caching strategies that Zotonic employs. Finally, we'll describe the optimisations we applied to Zotonic’s submodules and the database.</p>

<h2 id="why-zotonic-why-erlang">Why Zotonic? Why Erlang?</h2>

<p>The first work on Zotonic was started in 2008, and, like many projects, came from “scratching an itch”. Marc Worrell, the main Zotonic architect, had been working for seven years at Mediamatic Lab, in Amsterdam, on a Drupal-like <span class="caps">CMS</span> written in <span class="caps">PHP</span>/MySQL called Anymeta. Anymeta’s main paradigm was that it implemented a “pragmatic approach to the Semantic Web” by modeling everything in the system as generic “things”. Though successful, its implementations suffered from scalability problems.</p>

<p>After Marc left Mediamatic, he spent a few months designing a proper, Anymeta-like <span class="caps">CMS</span> from scratch. The main design goals for Zotonic were that it had to be easy to use for frontend developers; it had to support easy development of real-time web interfaces, simultaneously allowing long-lived connections and many short requests; and it had to have well-defined performance characteristics. More importantly, it had to solve the most common problems that limited performance in earlier Web development approaches–for example, it had to withstand the "Shashdot Effect" (a sudden rush of visitors).</p>

<h3 id="problems-with-the-classic-phpapache-approach">Problems with the Classic <span class="caps">PHP</span>+Apache Approach</h3>

<p>A classic <span class="caps">PHP</span> setup runs as a module inside a container web server like Apache. On each request, Apache decides how to handle the request. When it’s a <span class="caps">PHP</span> request, it spins up <code>mod_php5</code>, and then the <span class="caps">PHP</span> interpreter starts interpreting the script. This comes with startup latency: typically, such a spin-up already takes 5 ms, and then the <span class="caps">PHP</span> code still needs to run. This problem can partially be mitigated by using <span class="caps">PHP</span> accelerators which precompile the <span class="caps">PHP</span> script, bypassing the interpreter. The <span class="caps">PHP</span> startup overhead can also be mitigated by using a process manager like <span class="caps">PHP</span>-<span class="caps">FPM</span>.</p>

<p>Nevertheless, systems like that still suffer from a <em>shared nothing</em> architecture. When a script needs a database connection, it needs to create one itself. Same goes any other I/O resource that could otherwise be shared between requests. Various modules feature persistent connections to overcome this, but there is no general solution to this problem in <span class="caps">PHP</span>.</p>

<p>Handling long-lived client connections is also hard because such connections need a separate web server thread or process for every request. In the case of Apache and <span class="caps">PHP</span>-<span class="caps">FPM</span>, this does not scale with many concurrent long-lived connections.</p>

<h3 id="requirements-for-a-modern-web-framework">Requirements for a Modern Web Framework</h3>

<p>Modern web frameworks typically deal with three classes of <span class="caps">HTTP</span> request. First, there are dynamically generated pages: dynamically served, usually generated by a template processor. Second, there is static content: small and large files which do not change (e.g., JavaScript, <span class="caps">CSS</span>, and media assets). Third, there are long-lived connections: WebSockets and long-polling requests for adding interactivity and two-way communication to pages.</p>

<p>Before creating Zotonic, we were looking for a software framework and programming language that would allow us to meet our design goals (high performance, developer friendliness) and sidestep the bottlenecks associated with traditional web server systems. Ideally the software would meet the following requirements.</p>

<ul>
<li>Concurrent: it needs to support many concurrent connections that are not limited by the number of unix processes or <span class="caps">OS</span> threads.</li>
<li>Shared resources: it needs to have a mechanism to share resources cheaply (e.g., caching, db connections) between requests.</li>
<li>Hot code upgrades: for ease of development and the enabling of hot-upgrading production systems (keeping downtime to a minimum), it would be nice if code changes could be deployed in a running system, without needing to restart it.</li>
<li>Multi-core <span class="caps">CPU</span> support: a modern system needs to scale over multiple cores, as current CPUs tend to get scale in number of cores as opposed to increased clock speed.</li>
<li>Fault tolerant: the system needs to be able to handle exceptional situations, "badly behaving" code, anomalies or resource starvation. Ideally, the system would achieve this by having some kind of supervision mechanism to restart the failing parts.</li>
<li>Distributed: ideally, a system has built-in and easy to set up support for distribution over multiple nodes, to allow for better performance and protection against hardware failure.</li>
</ul>

<h3 id="erlang-to-the-rescue">Erlang to the Rescue</h3>

<p>To our knowledge, Erlang was the only language that met these requirements “out of the box”. The Erlang <span class="caps">VM</span>, combined with its Open Telecom Platform (<span class="caps">OTP</span>), provided the system that gave and continues to give us all the necessary features.</p>

<p>Erlang is a (mostly) functional programming language and runtime system. Erlang/<span class="caps">OTP</span> applications were originally developed for telephone switches, and are known for their fault-tolerance and their concurrent nature. Erlang employs an actor-based concurrency model: each actor is a lightweight “process” (green thread) and the only way to share state between processes is to pass messages. The Open Telecom Platform is the set of standard Erlang libraries which enable fault tolerance and process supervision, amongst others.</p>

<p>Fault tolerance is at the core of its programming paradigm: <em>let it crash</em> is the main philosophy of the system. As processes don’t share any state (to share state, they must send messages to each other), their state is isolated from other processes. As such, a single crashing process will never take down the system. When a process crashes, its supervisor process can decide to restart it.</p>

<p><em>Let it crash</em> also allows you to program for the happy case. Using pattern matching and function guards to assure a sane state means less error handling code is needed, which usually results in clean, concise, and readable code.</p>

<h2 id="zotonics-architecture">Zotonic’s Architecture</h2>

<p>Before we discuss Zotonic’s performance optimizations, let’s have a look at its architecture. <a href="#figure-9.1">Figure 9.1</a> describes Zotonic's most important components.</p>

<div class="center figure">
<a name="figure-9.1"></a><img src="zotonic-images/zotonic-architecture.png" alt="Figure 9.1 - The architecture of Zotonic" title="Figure 9.1 - The architecture of Zotonic" />
</div>

<p class="center figcaption">
<small>Figure 9.1 - The architecture of Zotonic</small>
</p>

<p>The diagram shows the layers of Zotonic that an <span class="caps">HTTP</span> request goes through. For discussing performance issues we’ll need to know what these layers are, and how they affect performance.</p>

<p>First, Zotonic comes with a built in web server, Mochiweb (another Erlang project). It does not require an external web server. This keeps the deployment dependencies to a minimum.<sup><a href="#fn1" class="footnoteRef" id="fnref1">1</a></sup></p>

<p>Like many web frameworks, a <span class="caps">URL</span> routing system is used to match requests to controllers. Controllers handle each request in a RESTful way, thanks to the Webmachine library.</p>

<p>Controllers are "dumb" on purpose, without much application-specific logic. Zotonic provides a number of standard controllers which, for the development of basic web applications, are often good enough. For instance, there is a <code>controller_template</code>, whose sole purpose it is to reply to <span class="caps">HTTP</span> <span class="caps">GET</span> requests by rendering a given template.</p>

<p>The template language is an Erlang-implementation of the well-known Django Template Language, called ErlyDTL. The general principle in Zotonic is that the templates drive the data requests. The templates decide which data they need, and retrieve it from the models.</p>

<p>Models expose functions to retrieve data from various data sources, like a database. Models expose an <span class="caps">API</span> to the templates, dictating how they can be used. The models are also responsible for caching their results in memory; they decide when and what is cached and for how long. When templates need data, they call a model as if it were a globally available variable.</p>

<p>A model is an Erlang wrapper module which is responsible for certain data. It contains the necessary functions to retrieve and store data in the way that the application needs. For instance, the central model of Zotonic is called <code>m.rsc</code>, which provide access to the generic resource (“page”) data model. Since resources use the database, <code>m_rsc.erl</code> uses a database connection to retrieve its data and pass it through to the template, caching it whenever it can.</p>

<p>This “templates drive the data” approach is different from other web frameworks like Rails and Django, which usually follow a more classical <span class="caps">MVC</span> approach where a controller assigns data to a template. Zotonic follows a less “controller-centric” approach, so that typical websites can be built by just writing templates.</p>

<!--TODO This used to reference the "Database Considerations" section, which was removed. -->

<p>Zotonic uses PostgreSQL for data persistence. <a href="#posa.zotonic.db">Data Model: a Document Database in <span class="caps">SQL</span></a> explains the rationale for this choice.</p>

<h3 id="additional-zotonic-concepts">Additional Zotonic Concepts</h3>

<p>While the main focus of this chapter are the performance characteristics of the web request stack, it is useful to know some of the other concepts that are at the heart of Zotonic.</p>

<dl>
<dt>Virtual hosting</dt>
<dd>A single Zotonic instance typically serves more than one site. It is designed for virtual hosting, including domain aliases and <span class="caps">SSL</span> support. And due to Erlang’s process-isolation, a crashing site does not affect any of the other sites running in the same <span class="caps">VM</span>.
</dd>
<dt>Modules</dt>
<dd>Modules are Zotonic’s way of grouping functionality together. Each module is in its own directory containing Erlang files, templates, assets, etc. They can be enabled on a per-site basis. Modules can hook into the admin system: for instance, the <code>mod_backup</code> module adds version control to the page editor and also runs a daily full database backup. Another module, <code>mod_github</code>, exposes a <code>webhook</code> which pulls, rebuilds and reloads a Zotonic site from github, allowing for continuous deployment.
</dd>
<dt>Notifications</dt>
<dd><p>To enable the loose coupling and extensibility of code, communication between modules and core components is done by a notification mechanism which functions either as a map or fold over the observers of a certain named notification. By listening to notifications it becomes easy for a module to override or augment certain behaviour. The calling function decides whether a map or fold is used. For instance, the <code>admin_menu</code> notification is a fold over the modules which allow modules to add or remove menu items in the admin menu.</p>
</dd>
<dt>Data model</dt>
<dd><p>The main data model that Zotonic uses can be compared to Drupal’s Node module; “every thing is a thing”. The data model consists of hierarchically categorized resources which connect to other resources using labelled edges. Like its source of inspiration, the Anymeta <span class="caps">CMS</span>, this data model is loosely based on the principles of the Semantic Web.</p>
</dd>
</dl>

<p>Zotonic is an extensible system, and all parts of the system add up when you consider performance. For instance, you might add a module that intercepts web requests, and does something on each request. Such a module might impact the performance of the entire system. In this chapter we’ll leave this out of consideration, and instead focus on the core performance issues.</p>

<h2 id="problem-solving-fighting-the-slashdot-effect">Problem Solving: Fighting the Slashdot Effect</h2>

<p>Most web sites live an unexceptional life in a small place somewhere on the web. That is, until one of their pages hit the front page of a popular website like <span class="caps">CNN</span>, <span class="caps">BBC</span> or Yahoo. In that case, the traffic to the website will likely increase to tens, hundreds, or even thousands of page requests per second in no time.</p>

<p>Such a sudden surge overloads a traditional web server and makes it unreachable. The term "Slashdot Effect" was named after the web site that started this kind of overwhelming referrals. Even worse, an overloaded server is sometimes very hard to restart. As the newly started server has empty caches, no database connections, often un-compiled templates, etc.</p>

<p>Many anonymous visitors requesting exactly the same page around the same time shouldn’t be able to overload a server. This problem is easily solved using a caching proxy like Varnish, which caches a static copy of the page and only checks for updates to the page once in a while.</p>

<p>A surge of traffic becomes more challenging when serving dynamic pages for every single visitor; these can't be cached. With Zotonic, we set out to solve this problem.</p>

<p>We realized that most web sites have</p>

<ul>
<li>only have a limited number of very popular pages,</li>
<li>a long tail of far less popular pages, and</li>
<li>many shared parts on all pages (menu, most read items, news, etc.).</li>
</ul>

<p>and decided to</p>

<ul>
<li>cache hot data in memory so no communication needed to access it,</li>
<li>share renderings of templates and sub-templates between requests and on pages on the web site, and</li>
<li>explicitly design the system to prevent overload on server start and restart.</li>
</ul>

<h3 id="cache-hot-data">Cache Hot Data</h3>

<p>Why fetch data from an external source (database, memcached) when another request fetched it already a couple of milliseconds ago? We always cache simple data requests. In the next section the caching mechanism is discussed in detail.</p>

<h3 id="share-rendered-templates-and-sub-templates-between-pages">Share Rendered Templates and Sub-templates Between Pages</h3>

<p>When rendering a page or included template, a developer can add optional caching directives. This caches the rendered result for a period of time.</p>

<p>Caching starts what we called the <em>memo</em> functionality: while the template is being rendered and one or more processes request the same rendering, the later processes will be suspended. When the rendering is done all waiting processes will be sent the rendering result</p>

<p>The memoization alone–without any further caching–gives a large performance boost by drastically reducing the amount of parallel template processing.</p>

<h3 id="prevent-overload-on-server-start-or-restart">Prevent Overload on Server Start or Restart</h3>

<p>Zotonic introduces several bottlenecks on purpose. These bottlenecks limit the access to processes that use limited resources or are expensive (in terms of <span class="caps">CPU</span> or memory) to perform. Bottlenecks are currently set up for the template compiler, the image resizing process, and the database connection pool.</p>

<p>The bottlenecks are implemented by having a limited worker pool for performing the requested action. For <span class="caps">CPU</span> or disk intensive work, like image resizing, there is only a single process handling the requests. Requesting processes post their request in the Erlang request queue for the process and wait until their request is handled. If a request times out it will just crash. Such a crashing request will return <span class="caps">HTTP</span> status 503 <em>Service not available</em>.</p>

<p>Waiting processes don’t use many resources and the bottlenecks protect against overload if a template is changed or an image on a hot page is replaced and needs cropping or resizing.</p>

<p>In short: a busy server can still dynamically update its templates, content and images without getting overloaded. At the same time it allows single requests to crash while the system itself continues to operate.</p>

<h3 id="the-database-connection-pool">The Database Connection Pool</h3>

<p>One more word on database connections. In Zotonic a process fetches a database connection from a pool of connections for every single query or transaction. This enables many concurrent processes to share a very limited number of database connections. Compare this with most (<span class="caps">PHP</span>) systems where every request holds a connection to the database for the duration of the complete request.</p>

<p>Zotonic closes unused database connections after a time of inactivity. One connection is always left open so that the system can always handle an incoming request or background activity quickly. The dynamic connection pool drastically reduces the number of open database connections on most Zotonic web sites to one or two.</p>

<h2 id="caching-layers">Caching Layers</h2>

<p>The hardest part of caching is cache invalidation: keeping the cached data fresh and purging stale data. Zotonic uses a central caching mechanism with dependency checks to solve this problem.</p>

<p>This section describes Zotonic’s caching mechanism in a top-down fashion: from the browser down through the stack to the database.</p>

<h3 id="client-side-caching">Client-Side Caching</h3>

<p>The client-side caching is done by the browser. The browser caches images, <span class="caps">CSS</span> and JavaScript files. Zotonic does not allow client-side caching of <span class="caps">HTML</span> pages, it always dynamically generates all pages. Because it is very efficient in doing so (as described in the previous section) and not caching <span class="caps">HTML</span> pages prevents showing old pages after users log in, log out, or comments are placed.</p>

<p>Zotonic improves client-side performance in two ways:</p>

<ol style="list-style-type: decimal">
<li>It allows caching of static files (<span class="caps">CSS</span>, JavaScript, images etc.)</li>
<li>It includes multiple <span class="caps">CSS</span> or JavaScript files in a single response</li>
</ol>

<p>The first is done by adding the appropriate <span class="caps">HTTP</span> headers to the request<sup><a href="#fn2" class="footnoteRef" id="fnref2">2</a></sup>:</p>

<pre><code>Last-Modified: Tue, 18 Dec 2012 20:32:56 GMT
Expires: Sun, 01 Jan 2023 14:55:37 GMT
Date: Thu, 03 Jan 2013 14:55:37 GMT
Cache-Control: public, max-age=315360000</code></pre>

<p>Multiple <span class="caps">CSS</span> or JavaScript files are concatenated into a single file, separating individual files by a tilde and only mentioning paths if they change between files:</p>

<pre><code>http://example.org/lib/bootstrap/css/bootstrap
  ~bootstrap-responsive~bootstrap-base-site~
  /css/jquery.loadmask~z.growl~z.modal~site~63523081976.css</code></pre>

<p>The number at the end is a timestamp of the newest file in the list. The necessary <span class="caps">CSS</span> link or JavaScript script tag is generated using the <code>{% lib %}</code> template tag.</p>

<h3 id="server-side-caching">Server-Side Caching</h3>

<p>Zotonic is a large system, and many parts in it do caching in some way. The sections below explain some of the more interesting parts.</p>

<h3 id="static-css-js-and-image-files">Static <span class="caps">CSS</span>, <span class="caps">JS</span> and Image Files</h3>

<p>The controller handling the static files has some optimizations for handling these files. It can decompose combined file requests into a list of individual files.</p>

<p>The controller has checks for the <code>If-Modified-Since</code> header, serving the <span class="caps">HTTP</span> status 304 <em>Not Modified</em> when appropriate.</p>

<p>On the first request it will concatenate the contents of all the static files into one byte array (an Erlang <em>binary</em>).<sup><a href="#fn3" class="footnoteRef" id="fnref3">3</a></sup> This byte array is then cached in the central depcache (see <a href="#posa.zotonic.depcache">Depcache</a>) in two forms: compressed (with gzip) and uncompressed. Depending on the <code>Accept-Encoding</code> headers sent by the browser, Zotonic will serve either the compressed or uncompressed version.</p>

<p>This caching mechanism is efficient enough that its performance is similar to many caching proxies, while still fully controlled by the web server. With an earlier version of Zotonic and on simple hardware (quad core 2.4 GHz Xeon from 2008) we saw throughputs of around 6000 requests/second and were able to saturate a gigabit ethernet connection requesting a small (~20 <span class="caps">KB</span>) image file.</p>

<h3 id="rendered-templates">Rendered Templates</h3>

<p>Templates are compiled into Erlang modules, after which the byte code is kept in memory. Compiled templates are called as regular Erlang functions.</p>

<p>The template system detects any changes to templates and will recompile the template during runtime. When compilation is finished Erlang’s hot code upgrade mechanism is used to load the newly compiled Erlang module.</p>

<p>The main page and template controllers have options to cache the template rendering result. Caching can also be enabled only for anonymous (not logged in) visitors. As for most websites, anonymous visitors generate the bulk of all requests and those pages will be not be personalized and (almost) be identical. Note that template rendering results is an intermediate result and not the final <span class="caps">HTML</span>. This intermediate result contains (among others) untranslated strings and JavaScript fragments. The final <span class="caps">HTML</span> is generated by parsing this intermediate structure, picking the correct translations and collecting all javascript.</p>

<p>The concatenated JavaScript, along with a unique page <span class="caps">ID</span>, is placed at the position of the <code>{% script %}</code> template tag. This should be just above the closing <code>&lt;/body&gt;</code> body tag. The unique page <span class="caps">ID</span> is used to match this rendered page with the handling Erlang processes and for WebSocket/Comet interaction on the page.</p>

<p>Like with any template language, templates can include other templates. In Zotonic, included templates are usually compiled inline to eliminate any performance lost by using included files.</p>

<p>Special options can force runtime inclusion. One of those options is caching. Caching can be enabled for anonymous visitors only, a caching period can be set, and cache dependencies can be added. These cache dependencies are used to invalidate the cached rendering if any of the shown resources is changed.</p>

<p>Another method to cache parts of templates is to use the <code>{% cache %} ... {% endcache %}</code> block tag, which caches a part of a template for a given amount of time. This tag has the same caching options as the include tag, but has the advantage that it can easily be added in existing templates.</p>

<h3 id="in-memory-caching">In-Memory Caching</h3>

<p>All caching is done in memory, in the Erlang <span class="caps">VM</span> itself. No communication between computers or operating system processes is needed to access the cached data. This greatly simplifies and optimizes the use of the cached data.</p>

<p>As a comparison, accessing a memcache server typically takes 0.5 milliseconds. In contrast, accessing main memory within the same process takes 1 nanoseconds on a <span class="caps">CPU</span> cache hit and 100 nanoseconds on a <span class="caps">CPU</span> cache miss–not to mention the huge speed difference between memory and network.<sup><a href="#fn4" class="footnoteRef" id="fnref4">4</a></sup></p>

<p>Zotonic has two in-memory caching mechanisms:<sup><a href="#fn5" class="footnoteRef" id="fnref5">5</a></sup></p>

<ol style="list-style-type: decimal">
<li>Depcache, the central per-site cache</li>
<li>Process Dictionary Memo Cache</li>
</ol>

<h3 id="depcache">Depcache</h3>

<p><a name="posa.zotonic.depcache"> </a></p>

<p>The central caching mechanism in every Zotonic site is the <em>depcache</em>, which is short for <em>dep</em>endency <em>cache</em>. The depcache is an in-memory key-value store with a list of dependencies for every stored key.</p>

<p>For every key in the depcache we store:</p>

<ul>
<li>the key’s value;</li>
<li>a serial number, a global integer incremented with every update request;</li>
<li>the key’s expiration time (counted in seconds);</li>
<li>a list of other keys that this key depends on (e.g., a resource <span class="caps">ID</span> displayed in a cached template); and</li>
<li>if the key is still being calculated, a list of processes waiting for the key’s value.</li>
</ul>

<p>If a key is requested then the cache checks if the key is present, not expired, and if the serial numbers of all the dependency keys are lower than serial number of the cached key. If the key was still valid its value is returned, otherwise the key and its value is removed from the cache and <code>undefined</code> is returned.</p>

<p>Alternatively if the key was being calculated then the requesting process would be added to the waiting list of the key.</p>

<p>The implementation makes use of <span class="caps">ETS</span>, the Erlang Term Storage, a standard hash table implementation which is part of the Erlang <span class="caps">OTP</span> distribution. The following <span class="caps">ETS</span> tables are created by Zotonic for the depcache:</p>

<ul>
<li>Meta table: the <span class="caps">ETS</span> table holding all stored keys, the expiration and the depending keys. A record in this table is written as <code>#meta{key, expire, serial, deps}</code>.</li>
<li>Deps table: the <span class="caps">ETS</span> table stores the serial for each key.</li>
<li>Data table: the <span class="caps">ETS</span> table that stores each key's data.</li>
<li>Waiting PIDs dictionary: the <span class="caps">ETS</span> table that stores the IDs of all processes waiting for the arrival of a key’s value.</li>
</ul>

<p>The <span class="caps">ETS</span> tables are optimized for parallel reads and usually directly accessed by the calling process. This prevents any communication between the calling process and the depcache process.</p>

<p>The depcache process is called for:</p>

<ul>
<li>memoization where processes wait for another process’s value to be calculated;</li>
<li><em>put</em> (store) requests, serializing the serial number increments; and</li>
<li>delete requests, also serializing the depcache access.</li>
</ul>

<p>The depcache can get quite large. To prevent it from growing too large there is a garbage collector process. The garbage collector slowly iterates over the complete depcache, evicting expired or invalidated keys. If the depcache size is above a certain threshold (100 MiB by default) then the garbage collector speeds up and evicts 10% of all encountered items. It keeps evicting until the cache is below its threshold size.</p>

<p>100 MiB might sound small in this area of multi-<span class="caps">TB</span> databases. However, as the cache mostly contains textual data it will be big enough to contain the hot data for most web sites. Otherwise the size of the cache can be changed in configuration.</p>

<h3 id="process-dictionary-memo-cache">Process Dictionary Memo Cache</h3>

<p>The other memory-caching paradigm in Zotonic is the process dictionary memo cache. As described earlier, the data access patterns are dictated by the templates. The caching system uses simple heuristics to optimize access to data.</p>

<p>Important in this optimization is data caching in the Erlang process dictionary of the process handling the request. The process dictionary is a simple key-value store in the same heap as the process. Basically, it adds state to the functional Erlang language. Use of the process dictionary is usually frowned upon for this reason, but for in-process caching it is useful.</p>

<p>When a resource is accessed (remember, a resource is the central data unit of Zotonic), it is copied into the process dictionary. The same is done for computational results–like access control checks–and other data like configuration values.</p>

<p>Every property of a resource–like its title, summary or body text–must, when shown on a page, perform an access control check and then fetch the requested property from the resource. Caching all the resource’s properties and its access checks greatly speeds up resource data usage and removes many drawbacks of the hard-to-predict data access patterns by templates.</p>

<p>As a page or process can use a lot of data this memo cache has a couple of pressure valves:</p>

<ol style="list-style-type: decimal">
<li>When holding more than 10,000 keys the whole process dictionary is flushed. This prevents process dictionaries holding many unused items, like what happens when looping through long lists of resources. Special Erlang variables like <code>$ancestors</code> are kept.</li>
<li>The memo cache must be programmatically enabled. This is automatically done for every incoming <span class="caps">HTTP</span> or WebSocket request and template rendering.</li>
<li>Between <span class="caps">HTTP</span>/WebSocket requests the process dictionary is flushed, as multiple sequential <span class="caps">HTTP</span>/WebSocket requests share the same process.</li>
<li>The memo cache doesn’t track dependencies. Any depcache deletion will also flush the complete process dictionary of the process performing the deletion.</li>
</ol>

<p>When the memo cache is disabled then every lookup is handled by the depcache. This results in a call to the depcache process and data copying between the depcache and the requesting process.</p>

<h2 id="the-erlang-virtual-machine">The Erlang Virtual Machine</h2>

<p>The Erlang Virtual Machine has a few properties that are important when looking at performance.</p>

<h3 id="processes-are-cheap">Processes are Cheap</h3>

<p>The Erlang <span class="caps">VM</span> is specifically designed to do many things in parallel, and as such has its own implementation of multiprocessing within the <span class="caps">VM</span>. Erlang processes are scheduled on a reduction count basis, where one reduction is roughly equivalent to a function call. A process is allowed to run until it pauses to wait for input (a message from some other process) or until it has executed a fixed number of reductions. For each <span class="caps">CPU</span> core, a scheduler is started with its own run queue. It is not uncommon for Erlang applications to have thousands to millions of processes alive in the <span class="caps">VM</span> at any given point in time.</p>

<p>Processes are not only cheap to start but also cheap in memory at 327 words per process, which amounts to ~2.5 KiB on a 64 bit machine.<sup><a href="#fn6" class="footnoteRef" id="fnref6">6</a></sup> This compares to ~500 KiB for Java and a default of 2 MiB for pthreads.</p>

<p>Since processes are so cheap to use, any processing that is not needed for a request’s result is spawned off into a separate process. Sending an email or logging are both examples of tasks that could be handled by separate processes.</p>

<h3 id="data-copying-is-expensive">Data Copying is Expensive</h3>

<p>In the Erlang <span class="caps">VM</span> messages between processes are relatively expensive, as the message is copied in the process. This copying is needed due to Erlang’s per-process garbage collector. Preventing data copying is important; which is why Zotonic’s depcache uses <span class="caps">ETS</span> tables, which can be accessed from any process.</p>

<h4 id="separate-heap-for-bigger-byte-arrays">Separate Heap for Bigger Byte Arrays</h4>

<p>There is a big exception for copying data between processes. Byte arrays larger than 64 bytes are not copied between processes. They have their own heap and are separately garbage collected.</p>

<p>This makes it cheap to send a big byte array between processes, as only a reference to the byte array is copied. However, it does make garbage collection harder, as all references must be garbage collected before the byte array can be freed.</p>

<p>Sometimes, references to parts of a big byte array are passed: the bigger byte array can’t be garbage collected until the reference to the smaller part is garbage collected. A consequence is that copying a byte array is an optimization if that frees up the bigger byte array.</p>

<h3 id="string-processing-is-expensive">String Processing is Expensive</h3>

<p>String processing in any functional language can be expensive because strings are often represented as linked lists of integers, and, due to the functional nature of Erlang, data cannot be destructively updated.</p>

<p>If a string is represented as a list, then it is processed using tail recursive functions and pattern matching. This makes it a natural fit for functional languages. The problem is that the data representation of a linked list has a big overhead and that messaging a list to another process always involves copying the full data structure. This makes a list a non-optimal choice for strings.</p>

<p>Erlang has its own middle-of-the-road answer to strings: io-lists. Io-lists are nested lists containing lists, integers (single byte value), byte arrays and references to parts of other byte arrays. Io-lists are extremely easy to use and appending, prefixing or inserting data is inexpensive, as they only need changes to relatively short lists, without any data copying.<sup><a href="#fn7" class="footnoteRef" id="fnref7">7</a></sup></p>

<p>An io-list can be sent as-is to a “port” (a file descriptor), which flattens the data structure to a byte stream and sends it to a socket.</p>

<p>Example of an io-list:</p>

<pre><code> [ &lt;&lt;&quot;Hello&quot;&gt;&gt;, 32, [ &lt;&lt;&quot;Wo&quot;&gt;&gt;, [114, 108], &lt;&lt;&quot;d&quot;&gt;&gt;].</code></pre>

<p>Which flattens to the byte array:</p>

<pre><code> &lt;&lt;&quot;Hello World&quot;&gt;&gt;.</code></pre>

<p>Interestingly, most string processing in a web application consists of:</p>

<ol style="list-style-type: decimal">
<li>Concatenating data (dynamic and static) into the resulting page.</li>
<li><span class="caps">HTML</span> escaping and sanitizing content values.</li>
</ol>

<p>Erlang’s io-list is the perfect data structure for the first use case. And the second use case is resolved by an aggressive sanitization of all content <em>before</em> it is stored in the database.</p>

<p>These two combined means that for Zotonic a rendered page is just a big concatenation of byte arrays and pre-sanitized values in a single io-list.</p>

<h3 id="implications-for-zotonic">Implications for Zotonic</h3>

<p>Zotonic makes heavy use of a relatively big data structure, the <em>Context</em>. This is a record containing all data needed for a request evaluation. It contains:</p>

<ul>
<li>The request data: headers, request arguments, body data etc.</li>
<li>Webmachine status</li>
<li>User information (e.g., user <span class="caps">ID</span>, access control information)</li>
<li>Language preference</li>
<li><code>User-Agent</code> class (e.g., text, phone, tablet, desktop)</li>
<li>References to special site processes (e.g., notifier, depcache, etc.)</li>
<li>Unique <span class="caps">ID</span> for the request being processed (this will become the page <span class="caps">ID</span>)</li>
<li>Session and page process IDs</li>
<li>Database connection process during a transaction</li>
<li>Accumulators for reply data (e.g., data, actions to be rendered, JavaScript files)</li>
</ul>

<p>All this data can make a large data structure. Sending this large Context to different processes working on the request would result in a substantial data copying overhead.</p>

<p>That is why we try to do most of the request processing in a single process: the Mochiweb process that accepted the request. Additional modules and extensions are called using function calls instead of using inter-process messages.</p>

<p>Sometimes an extension is implemented using a separate process. In that case the extension provides a function accepting the Context and the process <span class="caps">ID</span> of the extension process. This interface function is then responsible of efficiently messaging the extension process.</p>

<p>Zotonic also needs to send a message when rendering cacheable sub-templates. In this case the Context is pruned of all intermediate template results and some other unneeded data (like logging information) before the Context is messaged to the process rendering the sub-template.</p>

<p>We don’t care too much about messaging byte arrays as they are, in most cases, larger than 64 bytes and as such will not be copied between processes.</p>

<p>For serving large static files, there is the option of using the Linux <code>sendfile()</code> system call to delegate sending the file to the operating system.</p>

<h2 id="changes-to-the-webmachine-library">Changes to the Webmachine Library</h2>

<p>Webmachine is a library implementing an abstraction of the <span class="caps">HTTP</span> protocol. It is implemented on top of the Mochiweb library which implements the lower level <span class="caps">HTTP</span> handling, like acceptor processes, header parsing, etc.</p>

<p>Controllers are made by creating Erlang modules implementing callback functions. Examples of callback functions are <code>resource_exists</code>, <code>previously_existed</code>, <code>authorized</code>, <code>allowed_methods</code>, <code>process_post</code>, etc. Webmachine also matches request paths against a list of dispatch rules; assigning request arguments and selecting the correct controller for handling the <span class="caps">HTTP</span> request.</p>

<p>With Webmachine, handling the <span class="caps">HTTP</span> protocol becomes easy. We decided early on to build Zotonic on top of Webmachine for this reason.</p>

<p>While building Zotonic a couple of problems with Webmachine were encountered.</p>

<ol style="list-style-type: decimal">
<li>When we started, it supported only a single list of dispatch rules; not a list of rules per host (i.e., site).</li>
<li>Dispatch rules are set in the application environment, and copied to the request process when dispatching.</li>
<li>Some callback functions (like <code>last_modified</code>) are called multiple times during request evaluation.</li>
<li>When Webmachine crashes during request evaluation no log entry is made by the request logger.</li>
<li>No support for <span class="caps">HTTP</span> Upgrade, making WebSockets support harder.</li>
</ol>

<p>The first problem (no partitioning of dispatch rules) is only a nuisance. It makes the list of dispatch rules less intuitive and more difficult to interpret.</p>

<p>The second problem (copying the dispatch list for every request) turned out to be a show stopper for Zotonic. The lists could become so large that copying it could take the majority of time needed to handle a request.</p>

<p>The third problem (multiple calls to the same functions) forced controller writers to implement their own caching mechanisms, which is error prone.</p>

<p>The fourth problem (no log on crash) makes it harder to see problems when in production.</p>

<p>The fifth problem (no <span class="caps">HTTP</span> Upgrade) prevents us from using the nice abstractions available in Webmachine for WebSocket connections.</p>

<p>The above problems were so serious that we had to modify Webmachine for our own purposes.</p>

<p>First a new option was added: dispatcher. A dispatcher is a module implementing the <code>dispatch/3</code> function which matches a request to a dispatch list. The dispatcher also selects the correct site (virtual host) using the <span class="caps">HTTP</span> <code>Host</code> header. When testing a simple “hello world” controller, these changes gave a threefold increase of throughput. We also observed that the gain was much higher on systems with many virtual hosts and dispatch rules.</p>

<p>Webmachine maintains two data structures, one for the request data and one for the internal request processing state. These data structures were referring to each other and actually were almost always used in tandem, so we combined them in a single data structure. Which made it easier to remove the use of the process dictionary and add the new single data structure as an argument to all functions inside Webmachine. This resulted in 20% less processing time per request.</p>

<p>We optimized Webmachine in many other ways that we will not describe in detail here, but the most important points are:</p>

<ul>
<li>Return values of some controller callbacks are cached (<code>charsets_provided</code>, <code>content_types_provided</code>, <code>encodings_provided</code>, <code>last_modified</code>, and <code>generate_etag</code>).</li>
<li>More process dictionary use was removed (/less global state, clearer code, easier testing).</li>
<li>Separate logger process per request; even when a request crashes we have a log up to the point of the crash.</li>
<li>An <span class="caps">HTTP</span> Upgrade callback was added as a step after the <em>forbidden</em> access check to support WebSockets.</li>
<li>Originally, a controller was called a “resource”. We changed it to “controller” to make a clear distinction between the (data-)resources being served and the code serving those resources.</li>
<li>Some instrumentation was added to measure request speed and size.</li>
</ul>

<h2 id="data-model-a-document-database-in-sql">Data Model: a Document Database in <span class="caps">SQL</span></h2>

<p><a name="posa.zotonic.db"> </a></p>

<p>From a data perspective it is worth mentioning that all properties of a “resource” (Zotonic’s main data unit) are serialized into a binary blob; “real” database columns are only used for keys, querying and foreign key constraints.</p>

<p>Separate “pivot” fields and tables are added for properties, or combinations of properties that need indexing, like full text columns, date properties, etc.</p>

<p>When a resource is updated, a database trigger adds the resource’s <span class="caps">ID</span> to the pivot queue. This pivot queue is consumed by a separate Erlang background process which indexes batches of resources at a time in a single transaction.</p>

<p>Choosing <span class="caps">SQL</span> made it possible for us to hit the ground running: PostgreSQL has a well known query language, great stability, known performance, excellent tools, and both commercial and non-commercial support.</p>

<p>Beyond that, the database is not the limiting performance factor in Zotonic. If a query becomes the bottleneck, then it is the task of the developer to optimize that particular query using the database’s query analyzer.</p>

<p>Finally, the golden performance rule for working with any database is: Don’t hit the database; don’t hit the disk; don’t hit the network; hit your cache.</p>

<h2 id="benchmarks-statistics-and-optimizations">Benchmarks, Statistics and Optimizations</h2>

<p>We don’t believe too much in benchmarks as they often test only minimal parts of a system and don’t represent the performance of the whole system. Especially as a system has many moving parts and in Zotonic the caching system and handling common access patterns are an integral part of the design.</p>

<h3 id="a-simplified-benchmark">A Simplified Benchmark</h3>

<p>What a benchmark <em>might do</em> is show where you could optimize the system first.</p>

<p>With this in mind we benchmarked Zotonic using the TechEmpower <span class="caps">JSON</span> benchmark, which is basically testing the request dispatcher, <span class="caps">JSON</span> encoder, <span class="caps">HTTP</span> request handling and the <span class="caps">TCP</span>/<span class="caps">IP</span> stack.</p>

<p>The benchmark was performed on a Intel i7 quad core M620 @ 2.67 GHz. The command was <code>wrk -c 3000 -t 3000 http://localhost:8080/json</code>. The results are shown in Table 9.1.</p>

<table>
<caption><b>Table 9.1</b> - Benchmark results</caption>
<thead>
<tr class="header">
<th align="left">Platform</th>
<th align="left">x1000 Requests/sec</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">Node.js</td>
<td align="left">27</td>
</tr>
<tr class="even">
<td align="left">Cowboy (/Erlang)</td>
<td align="left">31</td>
</tr>
<tr class="odd">
<td align="left">Elli (/Erlang)</td>
<td align="left">38</td>
</tr>
<tr class="even">
<td align="left">Zotonic</td>
<td align="left">5.5</td>
</tr>
<tr class="odd">
<td align="left">Zotonic w/o access log</td>
<td align="left">7.5</td>
</tr>
<tr class="even">
<td align="left">Zotonic w/o access log, with dispatcher pool</td>
<td align="left">8.5</td>
</tr>
</tbody>
</table>

<p>Zotonic’s dynamic dispatcher and <span class="caps">HTTP</span> protocol abstraction gives lower scores in such a micro benchmark. Those are relatively easy to solve, and the solutions were already planned:</p>

<ul>
<li>Replace the standard webmachine logger with a more efficient one</li>
<li>Compile the dispatch rules in an Erlang module (instead of a single process interpreting the dispatch rule list)</li>
<li>Replace the MochiWeb <span class="caps">HTTP</span> handler with the Elli <span class="caps">HTTP</span> handler</li>
<li>Use byte arrays in Webmachine instead of the current character lists</li>
</ul>

<h3 id="real-life-performance">Real-Life Performance</h3>

<p>For the 2013 abdication of the Dutch queen and subsequent inauguration of the new Dutch king a national voting site was built using Zotonic. The client requested 100% availability and high performance, being able to handle 100,000 votes per hour.</p>

<p>The solution was a system with four virtual servers, each with 2 <span class="caps">GB</span> <span class="caps">RAM</span> and running their own independent Zotonic system. Three nodes handled voting, one node was for administration. All nodes were independent but the voting nodes shared every vote with the at least two other nodes, so no vote would be lost if a node crashed.</p>

<p>A single vote gave ~30 <span class="caps">HTTP</span> requests for dynamic <span class="caps">HTML</span> (in multiple languages), Ajax, and static assets like css and javascript. Multiple requests were needed for selecting the three projects to vote on and filling in the details of the voter.</p>

<p>When tested we easily met the customer’s requirements without pushing the system to the max. The voting simulation was stopped at 500,000 complete voting procedures per hour, using bandwidth of around 400 mbps, and 99% of request handling times were below 200 milliseconds.</p>

<p>From the above it is clear that Zotonic can handle popular dynamic web sites. On real hardware we have observed much higher performance, especially for the underlying I/O and database performance.</p>

<h2 id="conclusion">Conclusion</h2>

<p>When building a content management system or framework it is important to take the full stack of your application into consideration, from the web server, the request handling system, the caching systems, down to the database system. All parts must work well together for good performance.</p>

<p>Much performance can be gained by preprocessing data. An example of preprocessing is pre-escaping and sanitizing data before storing it into the database.</p>

<p>Caching hot data is a good strategy for web sites with a clear set of popular pages followed by a long tail of less popular pages. Placing this cache in the same memory space as the request handling code gives a clear edge over using separate caching servers, both in speed and simplicity.</p>

<p>Another optimization for handling sudden bursts in popularity is to dynamically match similar requests and process them once for the same result. When this is well implemented, a proxy can be avoided and all <span class="caps">HTML</span> pages generated dynamically.</p>

<p>Erlang is a great match for building dynamic web based systems due to its lightweight multiprocessing, failure handling, and memory management.</p>

<p>Using Erlang, Zotonic makes it possible to build a very competent and well-performing content management system and framework without needing separate web servers, caching proxies, memcache servers, or e-mail handlers. This greatly simplifies system management tasks.</p>

<p>On current hardware a single Zotonic server can handle thousands of dynamic page requests per second, thus easily serving the fast majority of web sites on the world wide web.</p>

<p>Using Erlang, Zotonic is prepared for the future of multi-core systems with dozens of cores and many gigabytes of memory.</p>

<h2 id="acknowledgements">Acknowledgements</h2>

<p>The authors would like to thank Michiel Klønhammer (Maximonster Interactive Things), Andreas Stenius, Maas-Maarten Zeeman and Atilla Erdődi.</p>

<div class="footnotes">
<ol>
<li id="fn1"><p>However, it is possible to put another web server in front, for example when other web systems are running on the same server. But for normal cases, this is not needed. It is interesting that a typical optimisation that other frameworks use is to put a caching web server such as Varnish in front of their application server for serving static files, but for Zotonic this does not speed up those requests significantly, as Zotonic also caches static files in memory.).<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>Note that Zotonic does not set an ETag. Some browsers check the ETag for every use of the file by making a request to the server. Which defies the whole idea of caching and making fewer requests.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>A byte array, or binary, is a native Erlang data type. If it is smaller than 64 bytes it is copied between processes, larger ones are shared between processes. Erlang also shares parts of byte arrays between processes with references to those parts and not copying the data itself, thus making these byte arrays an efficient and easy to use data type.<a href="#fnref3">↩</a></p></li>
<li id="fn4"><p>See "Latency Numbers Every Programmer Should Know" at <code>http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html</code>.<a href="#fnref4">↩</a></p></li>
<li id="fn5"><p>In addition to these mechanisms, the database server performs some in-memory caching, but that is not within the scope of this chapter.<a href="#fnref5">↩</a></p></li>
<li id="fn6"><p>See <code>http://www.erlang.org/doc/efficiency_guide/advanced.html#id68921</code><a href="#fnref6">↩</a></p></li>
<li id="fn7"><p>Erlang can also <em>share</em> parts of a byte array with references to those parts, thus circumventing the need to copy that data. An insert into a byte array can be represented by an io-list of three parts: a references to the unchanged head bytes, the inserted value, and a reference to the unchanged tail bytes.<a href="#fnref7">↩</a></p></li>
</ol>
</div>
  </body>
</html>
